Computation and Language 71
☆ Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning ACL 2024
Effective training of language models (LMs) for mathematical reasoning tasks
demands high-quality supervised fine-tuning data. Besides obtaining annotations
from human experts, a common alternative is sampling from larger and more
powerful LMs. However, this knowledge distillation approach can be costly and
unstable, particularly when relying on closed-source, proprietary LMs like
GPT-4, whose behaviors are often unpredictable. In this work, we demonstrate
that the reasoning abilities of small-scale LMs can be enhanced through
self-training, a process where models learn from their own outputs. We also
show that the conventional self-training can be further augmented by a
preference learning algorithm called Direct Preference Optimization (DPO). By
integrating DPO into self-training, we leverage preference data to guide LMs
towards more accurate and diverse chain-of-thought reasoning. We evaluate our
method across various mathematical reasoning tasks using different base models.
Our experiments show that this approach not only improves LMs' reasoning
performance but also offers a more cost-effective and scalable solution
compared to relying on large proprietary LMs.
comment: ACL 2024. Code and data are available at
https://github.com/TianduoWang/DPO-ST
☆ LoRA-Pro: Are Low-Rank Adapters Properly Optimized?
Low-Rank Adaptation, also known as LoRA, has emerged as a prominent method
for parameter-efficient fine-tuning foundation models by re-parameterizing the
original matrix into the product of two low-rank matrices. Despite its
efficiency, LoRA often yields inferior performance compared to full
fine-tuning. In this paper, we propose LoRA-Pro to bridge this performance gap.
Firstly, we delve into the optimization processes in LoRA and full fine-tuning.
We reveal that while LoRA employs low-rank approximation, it neglects to
approximate the optimization process of full fine-tuning. To address this, we
introduce a novel concept called the "equivalent gradient." This virtual
gradient makes the optimization process on the re-parameterized matrix
equivalent to LoRA, which can be used to quantify the differences between LoRA
and full fine-tuning. The equivalent gradient is derived from the gradients of
matrices $A$ and $B$. To narrow the performance gap, our approach minimizes the
differences between the equivalent gradient and the gradient obtained from full
fine-tuning during the optimization process. By solving this objective, we
derive optimal closed-form solutions for updating matrices $A$ and $B$. Our
method constrains the optimization process, shrinking the performance gap
between LoRA and full fine-tuning. Extensive experiments on natural language
processing tasks validate the effectiveness of our method.
☆ Recursive Introspection: Teaching Language Model Agents How to Self-Improve
A central piece in enabling intelligent agentic behavior in foundation models
is to make them capable of introspecting upon their behavior, reasoning, and
correcting their mistakes as more computation or interaction is available. Even
the strongest proprietary large language models (LLMs) do not quite exhibit the
ability of continually improving their responses sequentially, even in
scenarios where they are explicitly told that they are making a mistake. In
this paper, we develop RISE: Recursive IntroSpEction, an approach for
fine-tuning LLMs to introduce this capability, despite prior work hypothesizing
that this capability may not be possible to attain. Our approach prescribes an
iterative fine-tuning procedure, which attempts to teach the model how to alter
its response after having executed previously unsuccessful attempts to solve a
hard test-time problem, with optionally additional environment feedback. RISE
poses fine-tuning for a single-turn prompt as solving a multi-turn Markov
decision process (MDP), where the initial state is the prompt. Inspired by
principles in online imitation learning and reinforcement learning, we propose
strategies for multi-turn data collection and training so as to imbue an LLM
with the capability to recursively detect and correct its previous mistakes in
subsequent iterations. Our experiments show that RISE enables Llama2, Llama3,
and Mistral models to improve themselves with more turns on math reasoning
tasks, outperforming several single-turn strategies given an equal amount of
inference-time computation. We also find that RISE scales well, often attaining
larger benefits with more capable models. Our analysis shows that RISE makes
meaningful improvements to responses to arrive at the correct solution for
challenging prompts, without disrupting one-turn abilities as a result of
expressing more complex distributions.
☆ Exploring Scaling Trends in LLM Robustness
Nikolhaus Howe, Michał Zajac, Ian McKenzie, Oskar Hollinsworth, Tom Tseng, Pierre-Luc Bacon, Adam Gleave
Language model capabilities predictably improve from scaling a model's size
and training data. Motivated by this, increasingly large language models have
been trained, yielding an array of impressive capabilities. Yet these models
are vulnerable to adversarial prompts, such as "jailbreaks" that hijack models
to perform undesired behaviors, posing a significant risk of misuse. Prior work
indicates that computer vision models become more robust with model and data
scaling, raising the question: does language model robustness also improve with
scale? We study this question empirically, finding that larger models respond
substantially better to adversarial training, but there is little to no benefit
from model scale in the absence of explicit defenses.
comment: 31 pages
☆ The FIGNEWS Shared Task on News Media Narratives ACL 2024
Wajdi Zaghouani, Mustafa Jarrar, Nizar Habash, Houda Bouamor, Imed Zitouni, Mona Diab, Samhaa R. El-Beltagy, Muhammed AbuOdeh
We present an overview of the FIGNEWS shared task, organized as part of the
ArabicNLP 2024 conference co-located with ACL 2024. The shared task addresses
bias and propaganda annotation in multilingual news posts. We focus on the
early days of the Israel War on Gaza as a case study. The task aims to foster
collaboration in developing annotation guidelines for subjective tasks by
creating frameworks for analyzing diverse narratives highlighting potential
bias and propaganda. In a spirit of fostering and encouraging diversity, we
address the problem from a multilingual perspective, namely within five
languages: English, French, Arabic, Hebrew, and Hindi. A total of 17 teams
participated in two annotation subtasks: bias (16 teams) and propaganda (6
teams). The teams competed in four evaluation tracks: guidelines development,
annotation quality, annotation quantity, and consistency. Collectively, the
teams produced 129,800 data points. Key findings and implications for the field
are discussed.
comment: 18 pages, 10 tables, 1 figure, accepted to ArabicNLP 2024 co-located
with ACL 2024
☆ Dallah: A Dialect-Aware Multimodal Large Language Model for Arabic
Recent advancements have significantly enhanced the capabilities of
Multimodal Large Language Models (MLLMs) in generating and understanding
image-to-text content. Despite these successes, progress is predominantly
limited to English due to the scarcity of high quality multimodal resources in
other languages. This limitation impedes the development of competitive models
in languages such as Arabic. To alleviate this situation, we introduce an
efficient Arabic multimodal assistant, dubbed Dallah, that utilizes an advanced
language model based on LLaMA-2 to facilitate multimodal interactions. Dallah
demonstrates state-of-the-art performance in Arabic MLLMs. Through fine-tuning
six Arabic dialects, Dallah showcases its capability to handle complex
dialectal interactions incorporating both textual and visual elements. The
model excels in two benchmark tests: one evaluating its performance on Modern
Standard Arabic (MSA) and another specifically designed to assess dialectal
responses. Beyond its robust performance in multimodal interaction tasks,
Dallah has the potential to pave the way for further development of
dialect-aware Arabic MLLMs.
☆ Tracking linguistic information in transformer-based sentence embeddings through targeted sparsification RepL4NLP 2024
Analyses of transformer-based models have shown that they encode a variety of
linguistic information from their textual input. While these analyses have shed
a light on the relation between linguistic information on one side, and
internal architecture and parameters on the other, a question remains
unanswered: how is this linguistic information reflected in sentence
embeddings? Using datasets consisting of sentences with known structure, we
test to what degree information about chunks (in particular noun, verb or
prepositional phrases), such as grammatical number, or semantic role, can be
localized in sentence embeddings. Our results show that such information is not
distributed over the entire sentence embedding, but rather it is encoded in
specific regions. Understanding how the information from an input text is
compressed into sentence embeddings helps understand current transformer models
and help build future explainable neural models.
comment: 12 pages, 9 figures, 1 table, published in RepL4NLP 2024
☆ PEFT-U: Parameter-Efficient Fine-Tuning for User Personalization
The recent emergence of Large Language Models (LLMs) has heralded a new era
of human-AI interaction. These sophisticated models, exemplified by Chat-GPT
and its successors, have exhibited remarkable capabilities in language
understanding. However, as these LLMs have undergone exponential growth, a
crucial dimension that remains understudied is the personalization of these
models. Large foundation models such as GPT-3 etc. focus on creating a
universal model that serves a broad range of tasks and users. This approach
emphasizes the model's generalization capabilities, treating users as a
collective rather than as distinct individuals. While practical for many common
applications, this one-size-fits-all approach often fails to address the rich
tapestry of human diversity and individual needs. To explore this issue we
introduce the PEFT-U Benchmark: a new dataset for building and evaluating NLP
models for user personalization. \datasetname{} consists of a series of
user-centered tasks containing diverse and individualized expressions where the
preferences of users can potentially differ for the same input. Using PEFT-U,
we explore the challenge of efficiently personalizing LLMs to accommodate
user-specific preferences in the context of diverse user-centered tasks.
☆ Difficulty Estimation and Simplification of French Text Using LLMs
We leverage generative large language models for language learning
applications, focusing on estimating the difficulty of foreign language texts
and simplifying them to lower difficulty levels. We frame both tasks as
prediction problems and develop a difficulty classification model using labeled
examples, transfer learning, and large language models, demonstrating superior
accuracy compared to previous approaches. For simplification, we evaluate the
trade-off between simplification quality and meaning preservation, comparing
zero-shot and fine-tuned performances of large language models. We show that
meaningful text simplifications can be obtained with limited fine-tuning. Our
experiments are conducted on French texts, but our methods are
language-agnostic and directly applicable to other foreign languages.
comment: 10 pages, 4 figures
☆ I can listen but cannot read: An evaluation of two-tower multimodal systems for instrument recognition
Music two-tower multimodal systems integrate audio and text modalities into a
joint audio-text space, enabling direct comparison between songs and their
corresponding labels. These systems enable new approaches for classification
and retrieval, leveraging both modalities. Despite the promising results they
have shown for zero-shot classification and retrieval tasks, closer inspection
of the embeddings is needed. This paper evaluates the inherent zero-shot
properties of joint audio-text spaces for the case-study of instrument
recognition. We present an evaluation and analysis of two-tower systems for
zero-shot instrument recognition and a detailed analysis of the properties of
the pre-joint and joint embeddings spaces. Our findings suggest that audio
encoders alone demonstrate good quality, while challenges remain within the
text encoder or joint space projection. Specifically, two-tower systems exhibit
sensitivity towards specific words, favoring generic prompts over musically
informed ones. Despite the large size of textual encoders, they do not yet
leverage additional textual context or infer instruments accurately from their
descriptions. Lastly, a novel approach for quantifying the semantic
meaningfulness of the textual space leveraging an instrument ontology is
proposed. This method reveals deficiencies in the systems' understanding of
instruments and provides evidence of the need for fine-tuning text encoders on
musical data.
comment: Accepted to ISMIR 2024
☆ RestoreAgent: Autonomous Image Restoration Agent via Multimodal Large Language Models
Haoyu Chen, Wenbo Li, Jinjin Gu, Jingjing Ren, Sixiang Chen, Tian Ye, Renjing Pei, Kaiwen Zhou, Fenglong Song, Lei Zhu
Natural images captured by mobile devices often suffer from multiple types of
degradation, such as noise, blur, and low light. Traditional image restoration
methods require manual selection of specific tasks, algorithms, and execution
sequences, which is time-consuming and may yield suboptimal results. All-in-one
models, though capable of handling multiple tasks, typically support only a
limited range and often produce overly smooth, low-fidelity outcomes due to
their broad data distribution fitting. To address these challenges, we first
define a new pipeline for restoring images with multiple degradations, and then
introduce RestoreAgent, an intelligent image restoration system leveraging
multimodal large language models. RestoreAgent autonomously assesses the type
and extent of degradation in input images and performs restoration through (1)
determining the appropriate restoration tasks, (2) optimizing the task
sequence, (3) selecting the most suitable models, and (4) executing the
restoration. Experimental results demonstrate the superior performance of
RestoreAgent in handling complex degradation, surpassing human experts.
Furthermore, the system modular design facilitates the fast integration of new
tasks and models, enhancing its flexibility and scalability for various
applications.
☆ GermanPartiesQA: Benchmarking Commercial Large Language Models for Political Bias and Sycophancy
LLMs are changing the way humans create and interact with content,
potentially affecting citizens' political opinions and voting decisions. As
LLMs increasingly shape our digital information ecosystems, auditing to
evaluate biases, sycophancy, or steerability has emerged as an active field of
research. In this paper, we evaluate and compare the alignment of six LLMs by
OpenAI, Anthropic, and Cohere with German party positions and evaluate
sycophancy based on a prompt experiment. We contribute to evaluating political
bias and sycophancy in multi-party systems across major commercial LLMs. First,
we develop the benchmark dataset GermanPartiesQA based on the Voting Advice
Application Wahl-o-Mat covering 10 state and 1 national elections between 2021
and 2023. In our study, we find a left-green tendency across all examined LLMs.
We then conduct our prompt experiment for which we use the benchmark and
sociodemographic data of leading German parliamentarians to evaluate changes in
LLMs responses. To differentiate between sycophancy and steerabilty, we use 'I
am [politician X], ...' and 'You are [politician X], ...' prompts. Against our
expectations, we do not observe notable differences between prompting 'I am'
and 'You are'. While our findings underscore that LLM responses can be
ideologically steered with political personas, they suggest that observed
changes in LLM outputs could be better described as personalization to the
given context rather than sycophancy.
comment: 12 pages
☆ Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption
Large Language Models (LLMs), epitomized by ChatGPT' s release in late 2022,
have revolutionized various industries with their advanced language
comprehension. However, their efficiency is challenged by the Transformer
architecture' s struggle with handling long texts. KV-Cache has emerged as a
pivotal solution to this issue, converting the time complexity of token
generation from quadratic to linear, albeit with increased GPU memory overhead
proportional to conversation length. With the development of the LLM community
and academia, various KV-Cache compression methods have been proposed. In this
review, we dissect the various properties of KV-Cache and elaborate on various
methods currently used to optimize the KV-Cache space usage of LLMs. These
methods span the pre-training phase, deployment phase, and inference phase, and
we summarize the commonalities and differences among these methods.
Additionally, we list some metrics for evaluating the long-text capabilities of
large language models, from both efficiency and capability perspectives. Our
review thus sheds light on the evolving landscape of LLM optimization, offering
insights into future advancements in this dynamic field.
comment: to be published in CoLM 2024
☆ On the Effect of Purely Synthetic Training Data for Different Automatic Speech Recognition Architectures
In this work we evaluate the utility of synthetic data for training automatic
speech recognition (ASR). We use the ASR training data to train a
text-to-speech (TTS) system similar to FastSpeech-2. With this TTS we reproduce
the original training data, training ASR systems solely on synthetic data. For
ASR, we use three different architectures, attention-based encoder-decoder,
hybrid deep neural network hidden Markov model and a Gaussian mixture hidden
Markov model, showing the different sensitivity of the models to synthetic data
generation. In order to extend previous work, we present a number of ablation
studies on the effectiveness of synthetic vs. real training data for ASR. In
particular we focus on how the gap between training on synthetic and real data
changes by varying the speaker embedding or by scaling the model size. For the
latter we show that the TTS models generalize well, even when training scores
indicate overfitting.
comment: Accepted at the SynData4GenAI 2024 workshop
☆ What does Kiki look like? Cross-modal associations between speech sounds and visual shapes in vision-and-language models
Humans have clear cross-modal preferences when matching certain novel words
to visual shapes. Evidence suggests that these preferences play a prominent
role in our linguistic processing, language learning, and the origins of
signal-meaning mappings. With the rise of multimodal models in AI, such as
vision- and-language (VLM) models, it becomes increasingly important to uncover
the kinds of visio-linguistic associations these models encode and whether they
align with human representations. Informed by experiments with humans, we probe
and compare four VLMs for a well-known human cross-modal preference, the
bouba-kiki effect. We do not find conclusive evidence for this effect but
suggest that results may depend on features of the models, such as architecture
design, model size, and training details. Our findings inform discussions on
the origins of the bouba-kiki effect in human cognition and future developments
of VLMs that align well with human cross-modal associations.
comment: Appeared at the 13th edition of the Workshop on Cognitive Modeling
and Computational Linguistics (CMCL 2024)
☆ The Curious Case of Representational Alignment: Unravelling Visio-Linguistic Tasks in Emergent Communication
Natural language has the universal properties of being compositional and
grounded in reality. The emergence of linguistic properties is often
investigated through simulations of emergent communication in referential
games. However, these experiments have yielded mixed results compared to
similar experiments addressing linguistic properties of human language. Here we
address representational alignment as a potential contributing factor to these
results. Specifically, we assess the representational alignment between agent
image representations and between agent representations and input images. Doing
so, we confirm that the emergent language does not appear to encode human-like
conceptual visual features, since agent image representations drift away from
inputs whilst inter-agent alignment increases. We moreover identify a strong
relationship between inter-agent alignment and topographic similarity, a common
metric for compositionality, and address its consequences. To address these
issues, we introduce an alignment penalty that prevents representational drift
but interestingly does not improve performance on a compositional
discrimination task. Together, our findings emphasise the key role
representational alignment plays in simulations of language emergence.
comment: Appeared at the 13th edition of the Workshop on Cognitive Modeling
and Computational Linguistics (CMCL 2024)
☆ Positive Text Reframing under Multi-strategy Optimization
Differing from sentiment transfer, positive reframing seeks to substitute
negative perspectives with positive expressions while preserving the original
meaning. With the emergence of pre-trained language models (PLMs), it is
possible to achieve acceptable results by fine-tuning PLMs. Nevertheless,
generating fluent, diverse and task-constrained reframing text remains a
significant challenge. To tackle this issue, a \textbf{m}ulti-\textbf{s}trategy
\textbf{o}ptimization \textbf{f}ramework (MSOF) is proposed in this paper.
Starting from the objective of positive reframing, we first design positive
sentiment reward and content preservation reward to encourage the model to
transform the negative expressions of the original text while ensuring the
integrity and consistency of the semantics. Then, different decoding
optimization approaches are introduced to improve the quality of text
generation. Finally, based on the modeling formula of positive reframing, we
propose a multi-dimensional re-ranking method that further selects candidate
sentences from three dimensions: strategy consistency, text similarity and
fluency. Extensive experiments on two Seq2Seq PLMs, BART and T5, demonstrate
our framework achieves significant improvements on unconstrained and controlled
positive reframing tasks.
☆ Modelling Multimodal Integration in Human Concept Processing with Vision-and-Language Models
Representations from deep neural networks (DNNs) have proven remarkably
predictive of neural activity involved in both visual and linguistic
processing. Despite these successes, most studies to date concern unimodal
DNNs, encoding either visual or textual input but not both. Yet, there is
growing evidence that human meaning representations integrate linguistic and
sensory-motor information. Here we investigate whether the integration of
multimodal information operated by current vision-and-language DNN models
(VLMs) leads to representations that are more aligned with human brain activity
than those obtained by language-only and vision-only DNNs. We focus on fMRI
responses recorded while participants read concept words in the context of
either a full sentence or an accompanying picture. Our results reveal that VLM
representations correlate more strongly than language- and vision-only DNNs
with activations in brain areas functionally related to language processing. A
comparison between different types of visuo-linguistic architectures shows that
recent generative VLMs tend to be less brain-aligned than previous
architectures with lower performance on downstream applications. Moreover,
through an additional analysis comparing brain vs. behavioural alignment across
multiple VLMs, we show that -- with one remarkable exception -- representations
that strongly align with behavioural judgments do not correlate highly with
brain responses. This indicates that brain similarity does not go hand in hand
with behavioural similarity, and vice versa.
☆ The Power of Combining Data and Knowledge: GPT-4o is an Effective Interpreter of Machine Learning Models in Predicting Lymph Node Metastasis of Lung Cancer
Lymph node metastasis (LNM) is a crucial factor in determining the initial
treatment for patients with lung cancer, yet accurate preoperative diagnosis of
LNM remains challenging. Recently, large language models (LLMs) have garnered
significant attention due to their remarkable text generation capabilities.
Leveraging the extensive medical knowledge learned from vast corpora, LLMs can
estimate probabilities for clinical problems, though their performance has
historically been inferior to data-driven machine learning models. In this
paper, we propose a novel ensemble method that combines the medical knowledge
acquired by LLMs with the latent patterns identified by machine learning models
to enhance LNM prediction performance. Initially, we developed machine learning
models using patient data. We then designed a prompt template to integrate the
patient data with the predicted probability from the machine learning model.
Subsequently, we instructed GPT-4o, the most advanced LLM developed by OpenAI,
to estimate the likelihood of LNM based on patient data and then adjust the
estimate using the machine learning output. Finally, we collected three outputs
from the GPT-4o using the same prompt and ensembled these results as the final
prediction. Using the proposed method, our models achieved an AUC value of
0.765 and an AP value of 0.415 for LNM prediction, significantly improving
predictive performance compared to baseline machine learning models. The
experimental results indicate that GPT-4o can effectively leverage its medical
knowledge and the probabilities predicted by machine learning models to achieve
more accurate LNM predictions. These findings demonstrate that LLMs can perform
well in clinical risk prediction tasks, offering a new paradigm for integrating
medical knowledge and patient data in clinical predictions.
☆ A Large-Scale Sensitivity Analysis on Latent Embeddings and Dimensionality Reductions for Text Spatializations IEEE VIS 2024
The semantic similarity between documents of a text corpus can be visualized
using map-like metaphors based on two-dimensional scatterplot layouts. These
layouts result from a dimensionality reduction on the document-term matrix or a
representation within a latent embedding, including topic models. Thereby, the
resulting layout depends on the input data and hyperparameters of the
dimensionality reduction and is therefore affected by changes in them.
Furthermore, the resulting layout is affected by changes in the input data and
hyperparameters of the dimensionality reduction. However, such changes to the
layout require additional cognitive efforts from the user. In this work, we
present a sensitivity study that analyzes the stability of these layouts
concerning (1) changes in the text corpora, (2) changes in the hyperparameter,
and (3) randomness in the initialization. Our approach has two stages: data
measurement and data analysis. First, we derived layouts for the combination of
three text corpora and six text embeddings and a grid-search-inspired
hyperparameter selection of the dimensionality reductions. Afterward, we
quantified the similarity of the layouts through ten metrics, concerning local
and global structures and class separation. Second, we analyzed the resulting
42817 tabular data points in a descriptive statistical analysis. From this, we
derived guidelines for informed decisions on the layout algorithm and highlight
specific hyperparameter settings. We provide our implementation as a Git
repository at
https://github.com/hpicgs/Topic-Models-and-Dimensionality-Reduction-Sensitivity-Study
and results as Zenodo archive at https://doi.org/10.5281/zenodo.12772898.
comment: To be published at IEEE VIS 2024 conference
☆ Improving Domain-Specific ASR with LLM-Generated Contextual Descriptions INTERSPEECH 2024
End-to-end automatic speech recognition (E2E ASR) systems have significantly
improved speech recognition through training on extensive datasets. Despite
these advancements, they still struggle to accurately recognize domain specific
words, such as proper nouns and technical terminologies. To address this
problem, we propose a method to utilize the state-of-the-art Whisper without
modifying its architecture, preserving its generalization performance while
enabling it to leverage descriptions effectively. Moreover, we propose two
additional training techniques to improve the domain specific ASR: decoder
fine-tuning, and context perturbation. We also propose a method to use a Large
Language Model (LLM) to generate descriptions with simple metadata, when
descriptions are unavailable. Our experiments demonstrate that proposed methods
notably enhance domain-specific ASR accuracy on real-life datasets, with
LLM-generated descriptions outperforming human-crafted ones in effectiveness.
comment: Accepted to INTERSPEECH 2024
☆ Is the Digital Forensics and Incident Response Pipeline Ready for Text-Based Threats in LLM Era?
In the era of generative AI, the widespread adoption of Neural Text
Generators (NTGs) presents new cybersecurity challenges, particularly within
the realms of Digital Forensics and Incident Response (DFIR). These challenges
primarily involve the detection and attribution of sources behind advanced
attacks like spearphishing and disinformation campaigns. As NTGs evolve, the
task of distinguishing between human and NTG-authored texts becomes critically
complex. This paper rigorously evaluates the DFIR pipeline tailored for
text-based security systems, specifically focusing on the challenges of
detecting and attributing authorship of NTG-authored texts. By introducing a
novel human-NTG co-authorship text attack, termed CS-ACT, our study uncovers
significant vulnerabilities in traditional DFIR methodologies, highlighting
discrepancies between ideal scenarios and real-world conditions. Utilizing 14
diverse datasets and 43 unique NTGs, up to the latest GPT-4, our research
identifies substantial vulnerabilities in the forensic profiling phase,
particularly in attributing authorship to NTGs. Our comprehensive evaluation
points to factors such as model sophistication and the lack of distinctive
style within NTGs as significant contributors for these vulnerabilities. Our
findings underscore the necessity for more sophisticated and adaptable
strategies, such as incorporating adversarial learning, stylizing NTGs, and
implementing hierarchical attribution through the mapping of NTG lineages to
enhance source attribution. This sets the stage for future research and the
development of more resilient text-based security systems.
comment: This work has been submitted to the IEEE for possible publication.
Copyright may be transferred without notice, after which this version may no
longer be accessible
☆ Financial Statement Analysis with Large Language Models
We investigate whether an LLM can successfully perform financial statement
analysis in a way similar to a professional human analyst. We provide
standardized and anonymous financial statements to GPT4 and instruct the model
to analyze them to determine the direction of future earnings. Even without any
narrative or industry-specific information, the LLM outperforms financial
analysts in its ability to predict earnings changes. The LLM exhibits a
relative advantage over human analysts in situations when the analysts tend to
struggle. Furthermore, we find that the prediction accuracy of the LLM is on
par with the performance of a narrowly trained state-of-the-art ML model. LLM
prediction does not stem from its training memory. Instead, we find that the
LLM generates useful narrative insights about a company's future performance.
Lastly, our trading strategies based on GPT's predictions yield a higher Sharpe
ratio and alphas than strategies based on other models. Taken together, our
results suggest that LLMs may take a central role in decision-making.
comment: Previously posted on SSRN (May 21, 2024). See
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=4835311
☆ factgenie: A Framework for Span-based Evaluation of Generated Texts
We present factgenie: a framework for annotating and visualizing word spans
in textual model outputs. Annotations can capture various span-based phenomena
such as semantic inaccuracies or irrelevant text. With factgenie, the
annotations can be collected both from human crowdworkers and large language
models. Our framework consists of a web interface for data visualization and
gathering text annotations, powered by an easily extensible codebase.
comment: Accepted to INLG 2024 (System Demonstrations)
☆ Exploring Description-Augmented Dataless Intent Classification ACL 2024
In this work, we introduce several schemes to leverage description-augmented
embedding similarity for dataless intent classification using current
state-of-the-art (SOTA) text embedding models. We report results of our methods
on four commonly used intent classification datasets and compare against
previous works of a similar nature. Our work shows promising results for
dataless classification scaling to a large number of unseen intents. We show
competitive results and significant improvements (+6.12\% Avg.) over strong
zero-shot baselines, all without training on labelled or task-specific data.
Furthermore, we provide qualitative error analysis of the shortfalls of this
methodology to help guide future research in this area.
comment: Accepted to the 6th NLP for Conversational AI Workshop at ACL
2024(NLP4ConvAI)
☆ Shapley Value-based Contrastive Alignment for Multimodal Information Extraction
The rise of social media and the exponential growth of multimodal
communication necessitates advanced techniques for Multimodal Information
Extraction (MIE). However, existing methodologies primarily rely on direct
Image-Text interactions, a paradigm that often faces significant challenges due
to semantic and modality gaps between images and text. In this paper, we
introduce a new paradigm of Image-Context-Text interaction, where large
multimodal models (LMMs) are utilized to generate descriptive textual context
to bridge these gaps. In line with this paradigm, we propose a novel Shapley
Value-based Contrastive Alignment (Shap-CA) method, which aligns both
context-text and context-image pairs. Shap-CA initially applies the Shapley
value concept from cooperative game theory to assess the individual
contribution of each element in the set of contexts, texts and images towards
total semantic and modality overlaps. Following this quantitative evaluation, a
contrastive learning strategy is employed to enhance the interactive
contribution within context-text/image pairs, while minimizing the influence
across these pairs. Furthermore, we design an adaptive fusion module for
selective cross-modal fusion. Extensive experiments across four MIE datasets
demonstrate that our method significantly outperforms existing state-of-the-art
methods.
comment: Accepted at ACM Multimedia 2024
☆ Scaling A Simple Approach to Zero-Shot Speech Recognition
Despite rapid progress in increasing the language coverage of automatic
speech recognition, the field is still far from covering all languages with a
known writing script. Recent work showed promising results with a zero-shot
approach requiring only a small amount of text data, however, accuracy heavily
depends on the quality of the used phonemizer which is often weak for unseen
languages. In this paper, we present MMS Zero-shot a conceptually simpler
approach based on romanization and an acoustic model trained on data in 1,078
different languages or three orders of magnitude more than prior art. MMS
Zero-shot reduces the average character error rate by a relative 46% over 100
unseen languages compared to the best previous work. Moreover, the error rate
of our approach is only 2.5x higher compared to in-domain supervised baselines,
while our approach uses no labeled data for the evaluation languages at all.
comment: 9 pages
☆ Innovative Speech-Based Deep Learning Approaches for Parkinson's Disease Classification: A Systematic Review
Parkinson's disease (PD), the second most prevalent neurodegenerative
disorder worldwide, frequently presents with early-stage speech impairments.
Recent advancements in Artificial Intelligence (AI), particularly deep learning
(DL), have significantly enhanced PD diagnosis through the analysis of speech
data. Nevertheless, the progress of research is restricted by the limited
availability of publicly accessible speech-based PD datasets, primarily due to
privacy and ethical concerns. This review covers the latest DL-based AI
approaches for speech-based PD classification, focusing on performance,
available resources and associated challenges of 33 scientific works published
between 2020 and March 2024. These DL approaches are categorized into
end-to-end (E2E) learning, transfer learning (TL) and deep acoustic features
(DAF) extraction. Among E2E approaches, Convolutional Neural Networks (CNNs)
are prevalent, though Transformers are increasingly popular. E2E approaches
face challenges such as limited data and computational resources, especially
with Transformers. TL addresses these issues by providing more robust PD
diagnosis and better generalizability across languages. DAF extraction aims to
improve the explainability and interpretability of results by examining the
specific effects of deep features on both other DL approaches and more
traditional machine learning (ML) methods. However, it often underperforms
compared to E2E and TL approaches. This review also discusses unresolved issues
related to bias, explainability and privacy, highlighting the need for future
research.
comment: Submitted in Applied Sciences - peer reviewed Open Access journal.
This research was funded by the NWO research programme AiNed Fellowship
Grants under the project Responsible AI for Voice Diagnostics (RAIVD) - grant
number NGF.1607.22.013
☆ Unified Lexical Representation for Interpretable Visual-Language Alignment
Visual-Language Alignment (VLA) has gained a lot of attention since CLIP's
groundbreaking work. Although CLIP performs well, the typical direct latent
feature alignment lacks clarity in its representation and similarity scores. On
the other hand, lexical representation, a vector whose element represents the
similarity between the sample and a word from the vocabulary, is a natural
sparse representation and interpretable, providing exact matches for individual
words. However, lexical representations is difficult to learn due to no
ground-truth supervision and false-discovery issues, and thus requires complex
design to train effectively. In this paper, we introduce LexVLA, a more
interpretable VLA framework by learning a unified lexical representation for
both modalities without complex design. We use DINOv2 as our visual model for
its local-inclined features and Llama 2, a generative language model, to
leverage its in-context lexical prediction ability. To avoid the false
discovery, we propose an overuse penalty to refrain the lexical representation
from falsely frequently activating meaningless words. We demonstrate that these
two pre-trained uni-modal models can be well-aligned by fine-tuning on modest
multi-modal dataset and avoid intricate training configurations. On cross-modal
retrieval benchmarks, LexVLA, trained on the CC-12M multi-modal dataset,
outperforms baselines fine-tuned on larger datasets (e.g., YFCC15M) and those
trained from scratch on even bigger datasets (e.g., 1.1B data, including
CC-12M). We conduct extensive experiments to analyze LexVLA.
☆ Demystifying Verbatim Memorization in Large Language Models
Large Language Models (LLMs) frequently memorize long sequences verbatim,
often with serious legal and privacy implications. Much prior work has studied
such verbatim memorization using observational data. To complement such work,
we develop a framework to study verbatim memorization in a controlled setting
by continuing pre-training from Pythia checkpoints with injected sequences. We
find that (1) non-trivial amounts of repetition are necessary for verbatim
memorization to happen; (2) later (and presumably better) checkpoints are more
likely to verbatim memorize sequences, even for out-of-distribution sequences;
(3) the generation of memorized sequences is triggered by distributed model
states that encode high-level features and makes important use of general
language modeling capabilities. Guided by these insights, we develop stress
tests to evaluate unlearning methods and find they often fail to remove the
verbatim memorized information, while also degrading the LM. Overall, these
findings challenge the hypothesis that verbatim memorization stems from
specific model weights or mechanisms. Rather, verbatim memorization is
intertwined with the LM's general capabilities and thus will be very difficult
to isolate and suppress without degrading model quality.
☆ KiVA: Kid-inspired Visual Analogies for Testing Large Multimodal Models
Eunice Yiu, Maan Qraitem, Charlie Wong, Anisa Noor Majhi, Yutong Bai, Shiry Ginosar, Alison Gopnik, Kate Saenko
This paper investigates visual analogical reasoning in large multimodal
models (LMMs) compared to human adults and children. A "visual analogy" is an
abstract rule inferred from one image and applied to another. While benchmarks
exist for testing visual reasoning in LMMs, they require advanced skills and
omit basic visual analogies that even young children can make. Inspired by
developmental psychology, we propose a new benchmark of 1,400 visual
transformations of everyday objects to test LMMs on visual analogical reasoning
and compare them to children and adults. We structure the evaluation into three
stages: identifying what changed (e.g., color, number, etc.), how it changed
(e.g., added one object), and applying the rule to new scenarios. Our findings
show that while models like GPT-4V, LLaVA-1.5, and MANTIS identify the "what"
effectively, they struggle with quantifying the "how" and extrapolating this
rule to new objects. In contrast, children and adults exhibit much stronger
analogical reasoning at all three stages. Additionally, the strongest tested
model, GPT-4V, performs better in tasks involving simple visual attributes like
color and size, correlating with quicker human adult response times.
Conversely, more complex tasks such as number, rotation, and reflection, which
necessitate extensive cognitive processing and understanding of the 3D physical
world, present more significant challenges. Altogether, these findings
highlight the limitations of training models on data that primarily consists of
2D images and text.
comment: 9 pages. For the KiVA benchmark, see https://github.com/ey242/KiVA
☆ ERIT Lightweight Multimodal Dataset for Elderly Emotion Recognition and Multimodal Fusion Evaluation
ERIT is a novel multimodal dataset designed to facilitate research in a
lightweight multimodal fusion. It contains text and image data collected from
videos of elderly individuals reacting to various situations, as well as seven
emotion labels for each data sample. Because of the use of labeled images of
elderly users reacting emotionally, it is also facilitating research on emotion
recognition in an underrepresented age group in machine learning visual emotion
recognition. The dataset is validated through comprehensive experiments
indicating its importance in neural multimodal fusion research.
☆ Banyan: Improved Representation Learning with Explicit Structure
We present Banyan, an improved model to learn semantic representations by
inducing explicit structure over data. In contrast to prior approaches using
structure spanning single sentences, Banyan learns by resolving multiple
constituent structures into a shared one explicitly incorporating global
context. Combined with an improved message-passing scheme inspired by Griffin,
Banyan learns significantly better representations, avoids spurious false
negatives with contrastive learning, and drastically improves memory efficiency
in such explicit-structured models. Using the Self-StrAE framework, we show
that Banyan (a) outperforms baselines using sentential structure across various
settings (b) matches or outperforms unstructured baselines like GloVe
(+augmentations) and a RoBERTa medium (+simcse) pre-trained on 100M tokens,
despite having just a handful of (non-embedding) parameters, and (c) also
learns effective representations across several low resource (Asian and
African) languages as measured on SemRel tasks.
comment: First Draft
☆ BotEval: Facilitating Interactive Human Evaluation ACL 2024
Following the rapid progress in natural language processing (NLP) models,
language models are applied to increasingly more complex interactive tasks such
as negotiations and conversation moderations. Having human evaluators directly
interact with these NLP models is essential for adequately evaluating the
performance on such interactive tasks. We develop BotEval, an easily
customizable, open-source, evaluation toolkit that focuses on enabling
human-bot interactions as part of the evaluation process, as opposed to human
evaluators making judgements for a static input. BotEval balances flexibility
for customization and user-friendliness by providing templates for common use
cases that span various degrees of complexity and built-in compatibility with
popular crowdsourcing platforms. We showcase the numerous useful features of
BotEval through a study that evaluates the performance of various chatbots on
their effectiveness for conversational moderation and discuss how BotEval
differs from other annotation tools.
comment: ACL 2024 SDT, 10 pages
☆ Beyond Entity Alignment: Towards Complete Knowledge Graph Alignment via Entity-Relation Synergy
Knowledge Graph Alignment (KGA) aims to integrate knowledge from multiple
sources to address the limitations of individual Knowledge Graphs (KGs) in
terms of coverage and depth. However, current KGA models fall short in
achieving a ``complete'' knowledge graph alignment. Existing models primarily
emphasize the linkage of cross-graph entities but overlook aligning relations
across KGs, thereby providing only a partial solution to KGA. The semantic
correlations embedded in relations are largely overlooked, potentially
restricting a comprehensive understanding of cross-KG signals. In this paper,
we propose to conceptualize relation alignment as an independent task and
conduct KGA by decomposing it into two distinct but highly correlated
sub-tasks: entity alignment and relation alignment. To capture the mutually
reinforcing correlations between these objectives, we propose a novel
Expectation-Maximization-based model, EREM, which iteratively optimizes both
sub-tasks. Experimental results on real-world datasets demonstrate that EREM
consistently outperforms state-of-the-art models in both entity alignment and
relation alignment tasks.
☆ Cost-effective Instruction Learning for Pathology Vision and Language Analysis
Kaitao Chen, Mianxin Liu, Fang Yan, Lei Ma, Xiaoming Shi, Lilong Wang, Xiaosong Wang, Lifeng Zhu, Zhe Wang, Mu Zhou, Shaoting Zhang
The advent of vision-language models fosters the interactive conversations
between AI-enabled models and humans. Yet applying these models into clinics
must deal with daunting challenges around large-scale training data, financial,
and computational resources. Here we propose a cost-effective instruction
learning framework for conversational pathology named as CLOVER. CLOVER only
trains a lightweight module and uses instruction tuning while freezing the
parameters of the large language model. Instead of using costly GPT-4, we
propose well-designed prompts on GPT-3.5 for building generation-based
instructions, emphasizing the utility of pathological knowledge derived from
the Internet source. To augment the use of instructions, we construct a
high-quality set of template-based instructions in the context of digital
pathology. From two benchmark datasets, our findings reveal the strength of
hybrid-form instructions in the visual question-answer in pathology. Extensive
results show the cost-effectiveness of CLOVER in answering both open-ended and
closed-ended questions, where CLOVER outperforms strong baselines that possess
37 times more training parameters and use instruction data generated from
GPT-4. Through the instruction tuning, CLOVER exhibits robustness of few-shot
learning in the external clinical dataset. These findings demonstrate that
cost-effective modeling of CLOVER could accelerate the adoption of rapid
conversational applications in the landscape of digital pathology.
☆ Are Large Language Models Possible to Conduct Cognitive Behavioral Therapy?
Hao Shen, Zihan Li, Minqiang Yang, Minghui Ni, Yongfeng Tao, Zhengyang Yu, Weihao Zheng, Chen Xu, Bin Hu
In contemporary society, the issue of psychological health has become
increasingly prominent, characterized by the diversification, complexity, and
universality of mental disorders. Cognitive Behavioral Therapy (CBT), currently
the most influential and clinically effective psychological treatment method
with no side effects, has limited coverage and poor quality in most countries.
In recent years, researches on the recognition and intervention of emotional
disorders using large language models (LLMs) have been validated, providing new
possibilities for psychological assistance therapy. However, are LLMs truly
possible to conduct cognitive behavioral therapy? Many concerns have been
raised by mental health experts regarding the use of LLMs for therapy. Seeking
to answer this question, we collected real CBT corpus from online video
websites, designed and conducted a targeted automatic evaluation framework
involving the evaluation of emotion tendency of generated text, structured
dialogue pattern and proactive inquiry ability. For emotion tendency, we
calculate the emotion tendency score of the CBT dialogue text generated by each
model. For structured dialogue pattern, we use a diverse range of automatic
evaluation metrics to compare speaking style, the ability to maintain
consistency of topic and the use of technology in CBT between different models
. As for inquiring to guide the patient, we utilize PQA (Proactive Questioning
Ability) metric. We also evaluated the CBT ability of the LLM after integrating
a CBT knowledge base to explore the help of introducing additional knowledge to
enhance the model's CBT counseling ability. Four LLM variants with excellent
performance on natural language processing are evaluated, and the experimental
result shows the great potential of LLMs in psychological counseling realm,
especially after combining with other technological means.
☆ Describe Where You Are: Improving Noise-Robustness for Speech Emotion Recognition with Text Description of the Environment
Speech emotion recognition (SER) systems often struggle in real-world
environments, where ambient noise severely degrades their performance. This
paper explores a novel approach that exploits prior knowledge of testing
environments to maximize SER performance under noisy conditions. To address
this task, we propose a text-guided, environment-aware training where an SER
model is trained with contaminated speech samples and their paired noise
description. We use a pre-trained text encoder to extract the text-based
environment embedding and then fuse it to a transformer-based SER model during
training and inference. We demonstrate the effectiveness of our approach
through our experiment with the MSP-Podcast corpus and real-world additive
noise samples collected from the Freesound repository. Our experiment indicates
that the text-based environment descriptions processed by a large language
model (LLM) produce representations that improve the noise-robustness of the
SER system. In addition, our proposed approach with an LLM yields better
performance than our environment-agnostic baselines, especially in low
signal-to-noise ratio (SNR) conditions. When testing at -5dB SNR level, our
proposed method shows better performance than our best baseline model by 31.8 %
(arousal), 23.5% (dominance), and 9.5% (valence).
☆ Enhancing Agent Learning through World Dynamics Modeling
While large language models (LLMs) have been increasingly deployed across
tasks in language understanding and interactive decision-making, their
impressive performance is largely due to the comprehensive and in-depth domain
knowledge embedded within them. However, the extent of this knowledge can vary
across different domains. Existing methods often assume that LLMs already
possess such comprehensive and in-depth knowledge of their environment,
overlooking potential gaps in their understanding of actual world dynamics. To
address this gap, we introduce Discover, Verify, and Evolve (DiVE), a framework
that discovers world dynamics from a small number of demonstrations, verifies
the correctness of these dynamics, and evolves new, advanced dynamics tailored
to the current situation. Through extensive evaluations, we analyze the impact
of each component on performance and compare the automatically generated
dynamics from DiVE with human-annotated world dynamics. Our results demonstrate
that LLMs guided by DiVE can make better decisions, achieving rewards
comparable to human players in the Crafter environment.
☆ Examining the Influence of Political Bias on Large Language Model Performance in Stance Classification
Large Language Models (LLMs) have demonstrated remarkable capabilities in
executing tasks based on natural language queries. However, these models,
trained on curated datasets, inherently embody biases ranging from racial to
national and gender biases. It remains uncertain whether these biases impact
the performance of LLMs for certain tasks. In this study, we investigate the
political biases of LLMs within the stance classification task, specifically
examining whether these models exhibit a tendency to more accurately classify
politically-charged stances. Utilizing three datasets, seven LLMs, and four
distinct prompting schemes, we analyze the performance of LLMs on politically
oriented statements and targets. Our findings reveal a statistically
significant difference in the performance of LLMs across various politically
oriented stance classification tasks. Furthermore, we observe that this
difference primarily manifests at the dataset level, with models and prompting
schemes showing statistically similar performances across different stance
classification datasets. Lastly, we observe that when there is greater
ambiguity in the target the statement is directed towards, LLMs have poorer
stance classification accuracy.
comment: Accepted at ICWSM 2025
☆ Transformers on Markov Data: Constant Depth Suffices
Attention-based transformers have been remarkably successful at modeling
generative processes across various domains and modalities. In this paper, we
study the behavior of transformers on data drawn from \kth Markov processes,
where the conditional distribution of the next symbol in a sequence depends on
the previous $k$ symbols observed. We observe a surprising phenomenon
empirically which contradicts previous findings: when trained for sufficiently
long, a transformer with a fixed depth and $1$ head per layer is able to
achieve low test loss on sequences drawn from \kth Markov sources, even as $k$
grows. Furthermore, this low test loss is achieved by the transformer's ability
to represent and learn the in-context conditional empirical distribution. On
the theoretical side, our main result is that a transformer with a single head
and three layers can represent the in-context conditional empirical
distribution for \kth Markov sources, concurring with our empirical
observations. Along the way, we prove that \textit{attention-only} transformers
with $O(\log_2(k))$ layers can represent the in-context conditional empirical
distribution by composing induction heads to track the previous $k$ symbols in
the sequence. These results provide more insight into our current understanding
of the mechanisms by which transformers learn to capture context, by
understanding their behavior on Markov sources.
comment: 29 pages, 10 figures
☆ Efficient LLM Training and Serving with Heterogeneous Context Sharding among Attention Heads
Existing LLM training and inference frameworks struggle in boosting
efficiency with sparsity while maintaining the integrity of context and model
architecture. Inspired by the sharding concept in database and the fact that
attention parallelizes over heads on accelerators, we propose Sparsely-Sharded
(S2) Attention, an attention algorithm that allocates heterogeneous context
partitions for different attention heads to divide and conquer. S2-Attention
enforces each attention head to only attend to a partition of contexts
following a strided sparsity pattern, while the full context is preserved as
the union of all the shards. As attention heads are processed in separate
thread blocks, the context reduction for each head can thus produce end-to-end
speed-up and memory reduction. At inference, LLMs trained with S2-Attention can
then take the KV cache reduction as free meals with guaranteed model quality
preserve. In experiments, we show S2-Attentioncan provide as much as (1) 25.3X
wall-clock attention speed-up over FlashAttention-2, resulting in 6X reduction
in end-to-end training time and 10X inference latency, (2) on-par model
training quality compared to default attention, (3)perfect needle retrieval
accuracy over 32K context window. On top of the algorithm, we build DKernel, an
LLM training and inference kernel library that allows users to customize
sparsity patterns for their own models. We open-sourced DKerneland make it
compatible with Megatron, Pytorch, and vLLM.
comment: 10 pages
♻ ☆ Block Verification Accelerates Speculative Decoding
Ziteng Sun, Uri Mendlovic, Yaniv Leviathan, Asaf Aharoni, Ahmad Beirami, Jae Hun Ro, Ananda Theertha Suresh
Speculative decoding is an effective method for lossless acceleration of
large language models during inference. It uses a fast model to draft a block
of tokens which are then verified in parallel by the target model, and provides
a guarantee that the output is distributed identically to a sample from the
target model. In prior works, draft verification is performed independently
token-by-token. Surprisingly, we show that this approach is not optimal. We
propose Block Verification, a simple draft verification algorithm that verifies
the entire block jointly and provides additional wall-clock speedup. We prove
that the proposed mechanism is optimal in the expected number of tokens
produced each iteration and specifically is never worse than the standard
token-level verification. Empirically, block verification provides modest but
consistent wall-clock speedups over the standard token verification algorithm
of 5%-8% in a range of tasks and datasets. Given that block verification does
not increase code complexity, maintains the strong lossless guarantee of the
standard speculative decoding verification algorithm, cannot deteriorate
performance, and, in fact, consistently improves it, it can be used as a good
default in speculative decoding implementations.
♻ ☆ ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization
Haoran You, Yipin Guo, Yichao Fu, Wei Zhou, Huihong Shi, Xiaofan Zhang, Souvik Kundu, Amir Yazdanbakhsh, Yingyan Celine Lin
Large language models (LLMs) have shown impressive performance on language
tasks but face challenges when deployed on resource-constrained devices due to
their extensive parameters and reliance on dense multiplications, resulting in
high memory demands and latency bottlenecks. Shift-and-add reparameterization
offers a promising solution by replacing costly multiplications with
hardware-friendly primitives in both the attention and multi-layer perceptron
(MLP) layers of an LLM. However, current reparameterization techniques require
training from scratch or full parameter fine-tuning to restore accuracy, which
is resource-intensive for LLMs. To address this, we propose accelerating
pretrained LLMs through post-training shift-and-add reparameterization,
creating efficient multiplication-free models, dubbed ShiftAddLLM.
Specifically, we quantize each weight matrix into binary matrices paired with
group-wise scaling factors. The associated multiplications are reparameterized
into (1) shifts between activations and scaling factors and (2) queries and
adds according to the binary matrices. To reduce accuracy loss, we present a
multi-objective optimization method to minimize both weight and output
activation reparameterization errors. Additionally, based on varying
sensitivity across layers to reparameterization, we develop an automated bit
allocation strategy to further reduce memory usage and latency. Experiments on
five LLM families and eight tasks consistently validate the effectiveness of
ShiftAddLLM, achieving average perplexity improvements of 5.6 and 22.7 points
at comparable or lower latency compared to the most competitive quantized LLMs
at 3 and 2 bits, respectively, and more than 80% memory and energy reductions
over the original LLMs. Codes and models are available at
https://github.com/GATECH-EIC/ShiftAddLLM.
♻ ☆ When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models ICML 2024
Autoregressive Large Language Models (LLMs) have achieved impressive
performance in language tasks but face two significant bottlenecks: (1)
quadratic complexity in the attention module as the number of tokens increases,
and (2) limited efficiency due to the sequential processing nature of
autoregressive LLMs during generation. While linear attention and speculative
decoding offer potential solutions, their applicability and synergistic
potential for enhancing autoregressive LLMs remain uncertain. We conduct the
first comprehensive study on the efficacy of existing linear attention methods
for autoregressive LLMs, integrating them with speculative decoding. We
introduce an augmentation technique for linear attention that ensures
compatibility with speculative decoding, enabling more efficient training and
serving of LLMs. Extensive experiments and ablation studies involving seven
existing linear attention models and five encoder/decoder-based LLMs
consistently validate the effectiveness of our augmented linearized LLMs.
Notably, our approach achieves up to a 6.67 reduction in perplexity on the
LLaMA model and up to a 2$\times$ speedup during generation compared to prior
linear attention methods. Codes and models are available at
https://github.com/GATECH-EIC/Linearized-LLM.
comment: Accepted by ICML 2024; 17 pages; 10 figures; 16 tables
♻ ☆ A Unified Framework for Model Editing ACL 2024
ROME and MEMIT are largely believed to be two different model editing
algorithms, with the major difference between them being the ability to perform
batched edits. In this paper, we unify these two algorithms under a single
conceptual umbrella, optimizing for the same goal, which we call the
preservation-memorization objective. ROME uses an equality constraint to
optimize this objective to perform one edit at a time, whereas MEMIT employs a
more flexible least-square constraint that allows for batched edits. We
generalize ROME and enable batched editing with equality constraint in the form
of EMMET - an Equality-constrained Mass Model Editing algorithm for
Transformers, a new batched memory-editing algorithm. EMMET can perform
batched-edits up to a batch-size of 10,000, with very similar performance to
MEMIT across multiple dimensions. With the introduction of EMMET, we truly
unify ROME and MEMIT and show that both algorithms are equivalent in terms of
their optimization objective, their abilities (singular and batched editing),
their model editing performance and their limitations.
comment: Under review. To appear as poster at KnowledgeableLM Workshop
co-located with ACL 2024
♻ ☆ Regurgitative Training: The Value of Real Data in Training Large Language Models
What happens if we train a new Large Language Model (LLM) using data that are
at least partially generated by other LLMs? The explosive success of LLMs means
that a substantial amount of content online will be generated by LLMs rather
than humans, which will inevitably enter the training datasets of
next-generation LLMs. We evaluate the implications of such "regurgitative
training" on LLM performance. Through fine-tuning GPT-3.5 with data generated
either by itself or by other LLMs in a machine translation task, we find strong
evidence that regurgitative training clearly handicaps the performance of LLMs.
The same performance loss of regurgitative training is observed on transformer
models that we train from scratch. We find suggestive evidence that the
performance disadvantage of regurgitative training can be attributed to at
least two mechanisms: (1) higher error rates and (2) lower lexical diversity in
LLM-generated data as compared to real data. Based on these mechanisms, we
propose and evaluate three different strategies to mitigate the performance
loss of regurgitative training. First, we devise data-driven metrics to gauge
the quality of each LLM-generated data instance, and then carry out an ordered
training process where high-quality data are added before low-quality ones.
Second, we combine data generated by multiple different LLMs (as an attempt to
increase lexical diversity). Third, we train an AI detection classifier to
differentiate between LLM- and human-generated data, and include LLM-generated
data in the order of resemblance to human-generated data. All three strategies
can improve the performance of regurgitative training to some extent but are
not always able to fully close the gap from training with real data. Our
results highlight the value of real, human-generated data in training LLMs,
which cannot be easily substituted by synthetic, LLM-generated data.
♻ ☆ Machine Translation Hallucination Detection for Low and High Resource Languages using Large Language Models
Kenza Benkirane, Laura Gongas, Shahar Pelles, Naomi Fuchs, Joshua Darmon, Pontus Stenetorp, David Ifeoluwa Adelani, Eduardo Sánchez
Recent advancements in massively multilingual machine translation systems
have significantly enhanced translation accuracy; however, even the best
performing systems still generate hallucinations, severely impacting user
trust. Detecting hallucinations in Machine Translation (MT) remains a critical
challenge, particularly since existing methods excel with High-Resource
Languages (HRLs) but exhibit substantial limitations when applied to
Low-Resource Languages (LRLs). This paper evaluates hallucination detection
approaches using Large Language Models (LLMs) and semantic similarity within
massively multilingual embeddings. Our study spans 16 language directions,
covering HRLs, LRLs, with diverse scripts. We find that the choice of model is
essential for performance. On average, for HRLs, Llama3-70B outperforms the
previous state of the art by as much as 0.16 MCC (Matthews Correlation
Coefficient). However, for LRLs we observe that Claude Sonnet outperforms other
LLMs on average by 0.03 MCC. The key takeaway from our study is that LLMs can
achieve performance comparable or even better than previously proposed models,
despite not being explicitly trained for any machine translation task. However,
their advantage is less significant for LRLs.
comment: Authors Kenza Benkirane and Laura Gongas contributed equally to this
work
♻ ☆ Harmonic LLMs are Trustworthy
We introduce an intuitive method to test the robustness (stability and
explainability) of any black-box LLM in real-time via its local deviation from
harmoniticity, denoted as $\gamma$. To the best of our knowledge this is the
first completely model-agnostic and unsupervised method of measuring the
robustness of any given response from an LLM, based upon the model itself
conforming to a purely mathematical standard. To show general application and
immediacy of results, we measure $\gamma$ in 10 popular LLMs (ChatGPT,
Claude-2.1, Claude3.0, GPT-4, GPT-4o, Smaug-72B, Mixtral-8x7B, Llama2-7B,
Mistral-7B and MPT-7B) across thousands of queries in three objective domains:
WebQA, ProgrammingQA, and TruthfulQA. Across all models and domains tested,
human annotation confirms that $\gamma \to 0$ indicates trustworthiness, and
conversely searching higher values of $\gamma$ easily exposes examples of
hallucination, a fact that enables efficient adversarial prompt generation
through stochastic gradient ascent in $\gamma$. The low-$\gamma$ leaders among
the models in the respective domains are GPT-4o, GPT-4, and Smaug-72B,
providing evidence that mid-size open-source models can win out against large
commercial models.
comment: 15 pages, 2 figures, 16 tables; added Claude-3.0, GPT-4o, Mistral-7B,
Mixtral-8x7B, and more annotation for other models
♻ ☆ Improving Stance Detection by Leveraging Measurement Knowledge from Social Sciences: A Case Study of Dutch Political Tweets and Traditional Gender Role Division
Stance detection (SD) concerns automatically determining the viewpoint (i.e.,
in favour of, against, or neutral) of a text's author towards a target. SD has
been applied to many research topics, among which the detection of stances
behind political tweets is an important one. In this paper, we apply SD to a
dataset of tweets from official party accounts in the Netherlands between 2017
and 2021, with a focus on stances towards traditional gender role division, a
dividing issue between (some) Dutch political parties. To implement and improve
SD of traditional gender role division, we propose to leverage an established
survey instrument from social sciences, which has been validated for the
purpose of measuring attitudes towards traditional gender role division. Based
on our experiments, we show that using such a validated survey instrument helps
to improve SD performance.
♻ ☆ PATCH! Psychometrics-AssisTed benCHmarking of Large Language Models: A Case Study of Proficiency in 8th Grade Mathematics
Many existing benchmarks of large (multimodal) language models (LLMs) focus
on measuring LLMs' academic proficiency, often with also an interest in
comparing model performance with human test takers. While these benchmarks have
proven key to the development of LLMs, they suffer from several limitations,
including questionable measurement quality (e.g., Do they measure what they are
supposed to in a reliable way?), lack of quality assessment on the item level
(e.g., Are some items more important or difficult than others?) and unclear
human population reference (e.g., To whom can the model be compared?). In
response to these challenges, we propose leveraging knowledge from
psychometrics - a field dedicated to the measurement of latent variables like
academic proficiency - into LLM benchmarking. We make three primary
contributions. First, we introduce PATCH: a novel framework for
{P}sychometrics-{A}ssis{T}ed ben{CH}marking of LLMs. PATCH addresses the
aforementioned limitations, presenting a new direction for LLM benchmark
research. Second, we implement PATCH by measuring GPT-4 and Gemini-Pro-Vision's
proficiency in 8th grade mathematics against 56 human populations. We show that
adopting a psychometrics-based approach yields evaluation outcomes that diverge
from those based on existing benchmarking practices. Third, we release 4
high-quality datasets to support measuring and comparing LLM proficiency in
grade school mathematics and science against human populations.
♻ ☆ Resolving Discrepancies in Compute-Optimal Scaling of Language Models
Kaplan et al. and Hoffmann et al. developed influential scaling laws for the
optimal model size as a function of the compute budget, but these laws yield
substantially different predictions. We explain the discrepancy by reproducing
the Kaplan scaling law on two datasets (OpenWebText2 and RefinedWeb) and
identifying three factors causing the difference: last layer computational
cost, warmup duration, and scale-dependent optimizer tuning. With these factors
corrected, we obtain excellent agreement with the Hoffmann et al. (i.e.,
"Chinchilla") scaling law. Counter to a hypothesis of Hoffmann et al., we find
that careful learning rate decay is not essential for the validity of their
scaling law. As a secondary result, we derive scaling laws for the optimal
learning rate and batch size, finding that tuning the AdamW $\beta_2$ parameter
is essential at lower batch sizes.
comment: Fixing bug in small models with tuned LR
♻ ☆ The Larger the Better? Improved LLM Code-Generation via Budget Reallocation
It is a common belief that large language models (LLMs) are better than
smaller-sized ones. However, larger models also require significantly more time
and compute during inference. This begs the question: what happens when both
models operate under the same budget? (e.g., compute, run-time). To address
this question, we analyze code generation LLMs of various sizes and make
comparisons such as running a 70B model once vs. generating five outputs from a
13B model. We consider a standard unit-test setup, which can be used to select
the correct output from the smaller model. Our findings reveal that the
repeated use of smaller models can yield consistent improvements, with gains of
up to 15% across five tasks. On the other hand, in scenarios where unit-tests
are unavailable, a ranking-based selection of candidates from the smaller model
falls short of the performance of a single output from larger ones. Our results
highlight the potential of using smaller models instead of larger ones, and the
importance of studying approaches for ranking LLM outputs.
comment: COLM 2024
♻ ☆ AutoRE: Document-Level Relation Extraction with Large Language Models
Large Language Models (LLMs) have demonstrated exceptional abilities in
comprehending and generating text, motivating numerous researchers to utilize
them for Information Extraction (IE) purposes, including Relation Extraction
(RE). Nonetheless, most existing methods are predominantly designed for
Sentence-level Relation Extraction (SentRE) tasks, which typically encompass a
restricted set of relations and triplet facts within a single sentence.
Furthermore, certain approaches resort to treating relations as candidate
choices integrated into prompt templates, leading to inefficient processing and
suboptimal performance when tackling Document-Level Relation Extraction (DocRE)
tasks, which entail handling multiple relations and triplet facts distributed
across a given document, posing distinct challenges. To overcome these
limitations, we introduce AutoRE, an end-to-end DocRE model that adopts a novel
RE extraction paradigm named RHF (Relation-Head-Facts). Unlike existing
approaches, AutoRE does not rely on the assumption of known relation options,
making it more reflective of real-world scenarios. Additionally, we have
developed an easily extensible RE framework using a Parameters Efficient Fine
Tuning (PEFT) algorithm (QLoRA). Our experiments on the RE-DocRED dataset
showcase AutoRE's best performance, achieving state-of-the-art results,
surpassing TAG by 10.03\% and 9.03\% respectively on the dev and test set. The
code is available\url{https://github.com/THUDM/AutoRE} and the demonstration
video is provided https://www.youtube.com/watch?v=IhKRsZUAxKk
comment: 11 pages, 4 figures
♻ ☆ Large Language Models Understand Layout
Large language models (LLMs) demonstrate extraordinary abilities in a wide
range of natural language processing (NLP) tasks. In this paper, we show that,
beyond text understanding capability, LLMs are capable of processing text
layouts that are denoted by spatial markers. They are able to answer questions
that require explicit spatial perceiving and reasoning, while a drastic
performance drop is observed when the spatial markers from the original data
are excluded. We perform a series of experiments with the GPT-3.5, Baichuan2,
Llama2 and ChatGLM3 models on various types of layout-sensitive datasets for
further analysis. The experimental results reveal that the layout understanding
ability of LLMs is mainly introduced by the coding data for pretraining, which
is further enhanced at the instruction-tuning stage. In addition, layout
understanding can be enhanced by integrating low-cost, auto-generated data
approached by a novel text game. Finally, we show that layout understanding
ability is beneficial for building efficient visual question-answering (VQA)
systems.
♻ ☆ KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache ICML2024
Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, Xia Hu
Efficiently serving large language models (LLMs) requires batching of many
requests to reduce the cost per request. Yet, with larger batch sizes and
longer context lengths, the key-value (KV) cache, which stores attention keys
and values to avoid re-computations, significantly increases memory demands and
becomes the new bottleneck in speed and memory usage. Additionally, the loading
of the KV cache causes the computational core to be idle, which limits the
inference speed. A straightforward and effective solution to reduce KV cache
size is quantization, which decreases the total bytes taken by KV cache.
However, there is a lack of in-depth studies that explore the element
distribution of KV cache to understand the hardness and limitation of KV cache
quantization. To fill the gap, we conducted a comprehensive study on the
element distribution in KV cache of popular LLMs. Our findings indicate that
the key cache should be quantized per-channel, i.e., group elements along the
channel dimension and quantize them together. In contrast, the value cache
should be quantized per-token. From this analysis, we developed a tuning-free
2bit KV cache quantization algorithm named KIVI. With hardware-friendly
implementation, KIVI can enable Llama, Falcon, and Mistral models to maintain
almost the same quality while using $\mathbf{2.6\times}$ less peak memory
(including model weight). This reduction in memory usage enables up to
$\mathbf{4\times}$ larger batch size, bringing $\mathbf{2.35\times \sim
3.47\times}$ throughput on real LLM inference workload. The source code is
available at https://github.com/jy-yuan/KIVI.
comment: ICML2024
♻ ☆ Identifying Semantic Induction Heads to Understand In-Context Learning
Although large language models (LLMs) have demonstrated remarkable
performance, the lack of transparency in their inference logic raises concerns
about their trustworthiness. To gain a better understanding of LLMs, we conduct
a detailed analysis of the operations of attention heads and aim to better
understand the in-context learning of LLMs. Specifically, we investigate
whether attention heads encode two types of relationships between tokens
present in natural languages: the syntactic dependency parsed from sentences
and the relation within knowledge graphs. We find that certain attention heads
exhibit a pattern where, when attending to head tokens, they recall tail tokens
and increase the output logits of those tail tokens. More crucially, the
formulation of such semantic induction heads has a close correlation with the
emergence of the in-context learning ability of language models. The study of
semantic attention heads advances our understanding of the intricate operations
of attention heads in transformers, and further provides new insights into the
in-context learning of LLMs.
♻ ☆ SAFETY-J: Evaluating Safety with Critique
The deployment of Large Language Models (LLMs) in content generation raises
significant safety concerns, particularly regarding the transparency and
interpretability of content evaluations. Current methods, primarily focused on
binary safety classifications, lack mechanisms for detailed critique, limiting
their utility for model improvement and user trust. To address these
limitations, we introduce SAFETY-J, a bilingual generative safety evaluator for
English and Chinese with critique-based judgment. SAFETY-J utilizes a robust
training dataset that includes diverse dialogues and augmented query-response
pairs to assess safety across various scenarios comprehensively. We establish
an automated meta-evaluation benchmark that objectively assesses the quality of
critiques with minimal human intervention, facilitating scalable and continuous
improvement. Additionally, SAFETY-J employs an iterative preference learning
technique to dynamically refine safety assessments based on meta-evaluations
and critiques. Our evaluations demonstrate that SAFETY-J provides more nuanced
and accurate safety evaluations, thereby enhancing both critique quality and
predictive reliability in complex content scenarios. To facilitate further
research and application, we open-source SAFETY-J's training protocols,
datasets, and code at \url{https://github.com/GAIR-NLP/Safety-J}.
♻ ☆ Behavioral Testing: Can Large Language Models Implicitly Resolve Ambiguous Entities?
One of the major aspects contributing to the striking performance of large
language models (LLMs) is the vast amount of factual knowledge accumulated
during pre-training. Yet, many LLMs suffer from self-inconsistency, which
raises doubts about their trustworthiness and reliability. In this paper, we
focus on entity type ambiguity and analyze current state-of-the-art LLMs for
their proficiency and consistency in applying their factual knowledge when
prompted for entities under ambiguity. To do so, we propose an evaluation
protocol that disentangles knowing from applying knowledge, and test
state-of-the-art LLMs on 49 entities. Our experiments reveal that LLMs perform
poorly with ambiguous prompts, achieving only 80% accuracy. Our results further
demonstrate systematic discrepancies in LLM behavior and their failure to
consistently apply information, indicating that the models can exhibit
knowledge without being able to utilize it, significant biases for preferred
readings, as well as self inconsistencies. Our study highlights the importance
of handling entity ambiguity in future for more trustworthy LLMs
♻ ☆ Brand Network Booster: A new system for improving brand connectivity
This paper presents a new decision support system offered for an in-depth
analysis of semantic networks, which can provide insights for a better
exploration of a brand's image and the improvement of its connectivity. In
terms of network analysis, we show that this goal is achieved by solving an
extended version of the Maximum Betweenness Improvement problem, which includes
the possibility of considering adversarial nodes, constrained budgets, and
weighted networks - where connectivity improvement can be obtained by adding
links or increasing the weight of existing connections. Our contribution
includes a new algorithmic framework and the integration of this framework into
a software system called Brand Network Booster (BNB), which supports brand
connectivity evaluation and improvement. We present this new system together
with three case studies, and we also discuss its performance. Our tool and
approach are valuable to both network scholars and in facilitating strategic
decision-making processes for marketing and communication managers across
various sectors, be it public or private.
♻ ☆ Automatic Textual Normalization for Hate Speech Detection
Social media data is a valuable resource for research, yet it contains a wide
range of non-standard words (NSW). These irregularities hinder the effective
operation of NLP tools. Current state-of-the-art methods for the Vietnamese
language address this issue as a problem of lexical normalization, involving
the creation of manual rules or the implementation of multi-staged deep
learning frameworks, which necessitate extensive efforts to craft intricate
rules. In contrast, our approach is straightforward, employing solely a
sequence-to-sequence (Seq2Seq) model. In this research, we provide a dataset
for textual normalization, comprising 2,181 human-annotated comments with an
inter-annotator agreement of 0.9014. By leveraging the Seq2Seq model for
textual normalization, our results reveal that the accuracy achieved falls
slightly short of 70%. Nevertheless, textual normalization enhances the
accuracy of the Hate Speech Detection (HSD) task by approximately 2%,
demonstrating its potential to improve the performance of complex NLP tasks.
Our dataset is accessible for research purposes.
comment: 2023 International Conference on Intelligent Systems Design and
Applications (ISDA2023)
♻ ☆ LyricWhiz: Robust Multilingual Zero-shot Lyrics Transcription by Whispering to ChatGPT
Le Zhuo, Ruibin Yuan, Jiahao Pan, Yinghao Ma, Yizhi LI, Ge Zhang, Si Liu, Roger Dannenberg, Jie Fu, Chenghua Lin, Emmanouil Benetos, Wei Xue, Yike Guo
We introduce LyricWhiz, a robust, multilingual, and zero-shot automatic
lyrics transcription method achieving state-of-the-art performance on various
lyrics transcription datasets, even in challenging genres such as rock and
metal. Our novel, training-free approach utilizes Whisper, a weakly supervised
robust speech recognition model, and GPT-4, today's most performant chat-based
large language model. In the proposed method, Whisper functions as the "ear" by
transcribing the audio, while GPT-4 serves as the "brain," acting as an
annotator with a strong performance for contextualized output selection and
correction. Our experiments show that LyricWhiz significantly reduces Word
Error Rate compared to existing methods in English and can effectively
transcribe lyrics across multiple languages. Furthermore, we use LyricWhiz to
create the first publicly available, large-scale, multilingual lyrics
transcription dataset with a CC-BY-NC-SA copyright license, based on
MTG-Jamendo, and offer a human-annotated subset for noise level estimation and
evaluation. We anticipate that our proposed method and dataset will advance the
development of multilingual lyrics transcription, a challenging and emerging
task.
comment: 9 pages, 2 figures, 5 tables, accepted by ISMIR 2023
♻ ☆ CIBench: Evaluating Your LLMs with a Code Interpreter Plugin
Songyang Zhang, Chuyu Zhang, Yingfan Hu, Haowen Shen, Kuikun Liu, Zerun Ma, Fengzhe Zhou, Wenwei Zhang, Xuming He, Dahua Lin, Kai Chen
While LLM-Based agents, which use external tools to solve complex problems,
have made significant progress, benchmarking their ability is challenging,
thereby hindering a clear understanding of their limitations. In this paper, we
propose an interactive evaluation framework, named CIBench, to comprehensively
assess LLMs' ability to utilize code interpreters for data science tasks. Our
evaluation framework includes an evaluation dataset and two evaluation modes.
The evaluation dataset is constructed using an LLM-human cooperative approach
and simulates an authentic workflow by leveraging consecutive and interactive
IPython sessions. The two evaluation modes assess LLMs' ability with and
without human assistance. We conduct extensive experiments to analyze the
ability of 24 LLMs on CIBench and provide valuable insights for future LLMs in
code interpreter utilization.
comment: Under review. The first three authors contribute equally, and
Songyang Zhang is the project leader
♻ ☆ The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants ACL 2024
Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, Madian Khabsa
We present Belebele, a multiple-choice machine reading comprehension (MRC)
dataset spanning 122 language variants. Significantly expanding the language
coverage of natural language understanding (NLU) benchmarks, this dataset
enables the evaluation of text models in high-, medium-, and low-resource
languages. Each question is based on a short passage from the Flores-200
dataset and has four multiple-choice answers. The questions were carefully
curated to discriminate between models with different levels of general
language comprehension. The English dataset on its own proves difficult enough
to challenge state-of-the-art language models. Being fully parallel, this
dataset enables direct comparison of model performance across all languages. We
use this dataset to evaluate the capabilities of multilingual masked language
models (MLMs) and large language models (LLMs). We present extensive results
and find that despite significant cross-lingual transfer in English-centric
LLMs, much smaller MLMs pretrained on balanced multilingual data still
understand far more languages. We also observe that larger vocabulary size and
conscious vocabulary construction correlate with better performance on
low-resource languages. Overall, Belebele opens up new avenues for evaluating
and analyzing the multilingual capabilities of NLP systems.
comment: ACL 2024
♻ ☆ CCoE: A Compact LLM with Collaboration of Experts
In the domain of Large Language Model (LLM), LLMs demonstrate significant
capabilities in natural language understanding and generation. With the growing
needs of applying LLMs on various domains, it is a research question that how
to efficiently train and build a model that has expertise in different domains
but with a low training cost. We propose CCoE architecture, a framework of
easily coupling multiple strong domain experts together to fuse into a big LLM,
provides a collective way of utilizing the different domain expert LLMs.
Besides, training a large collaborative of multiple expert LLMs requires a high
requirements on training sources. CCoE bypasses this problem through isolating
other experts and train each expert separately. The design of CCoE assembles
multiple expert LLMs through the CoE (Collaboration of Experts) layer. Each CoE
layer could have one or more expert LLMs. Expert LLMs have different number of
layers and have been well-trained for different domain tasks. Each expert is
fine-tuned to be able to achieve the comparable results with SOTA domain LLMs.
We start from 5 experts in the domain of Code, Math, Law, text-to-SQL and
Medical. The results indicate that our CCoE framework can easily and
efficiently boost nearly 10%-20% performance on original base model in
different domains but using less resources on training, as well as inference.
♻ ☆ Automatically Extracting Numerical Results from Randomized Controlled Trials with Large Language Models
Meta-analyses statistically aggregate the findings of different randomized
controlled trials (RCTs) to assess treatment effectiveness. Because this yields
robust estimates of treatment effectiveness, results from meta-analyses are
considered the strongest form of evidence. However, rigorous evidence syntheses
are time-consuming and labor-intensive, requiring manual extraction of data
from individual trials to be synthesized. Ideally, language technologies would
permit fully automatic meta-analysis, on demand. This requires accurately
extracting numerical results from individual trials, which has been beyond the
capabilities of natural language processing (NLP) models to date. In this work,
we evaluate whether modern large language models (LLMs) can reliably perform
this task. We annotate (and release) a modest but granular evaluation dataset
of clinical trial reports with numerical findings attached to interventions,
comparators, and outcomes. Using this dataset, we evaluate the performance of
seven LLMs applied zero-shot for the task of conditionally extracting numerical
findings from trial reports. We find that massive LLMs that can accommodate
lengthy inputs are tantalizingly close to realizing fully automatic
meta-analysis, especially for dichotomous (binary) outcomes (e.g., mortality).
However, LLMs -- including ones trained on biomedical texts -- perform poorly
when the outcome measures are complex and tallying the results requires
inference. This work charts a path toward fully automatic meta-analysis of RCTs
via LLMs, while also highlighting the limitations of existing models for this
aim.
comment: 25 pages, 7 figures, 6 tables, MLHC 2024
♻ ☆ Towards the Law of Capacity Gap in Distilling Language Models
Language model (LM) distillation is a trending area that aims to distil the
knowledge residing in a large teacher LM to a small student one. While various
methods have been proposed to maximize the effectiveness of the distillation,
significant challenges persist, particularly when there is a substantial
capacity gap between the teacher and student LMs. This issue, often referred to
as the \textit{curse} of capacity gap, suggests that a larger teacher does not
necessarily result in a superior student compared to one distilled from a
smaller teacher. In other words, there is likely an optimal teacher yielding
the best student along the scaling course of the teacher. However, the curse of
capacity gap can not be tackled without notable compute overhead, as indicated
in previous studies. In the context of large LMs (LLMs), previously viable
approaches become much less meaningful, as it is an impossible triangle to
distill an expected student from an optimal teacher student with small compute
overhead. Fortunately, the impossible triangle can fortunately be possible
provided an inducted \textit{law} of capacity gap. In this paper, we take the
spirits of scaling law and reveal that the optimal teacher scale almost
consistently follows a linear scaling with the student scale across different
model architectures and data scales. The law later guides us to distil a 3B
student LM (termed \textsc{MiniMA}) from LLaMA2-7B. \textsc{MiniMA} is
demonstrated to outperform a wide range of 3B competitors and could even
compete with several 7B models.
comment: 32 pages, 10 figures, 15 tables, work in progress. Code and
checkpoints are available at https://github.com/GeneZC/MiniMA
♻ ☆ Adapting Large Language Models to Domains via Reading Comprehension ICLR 2024
We explore how continued pre-training on domain-specific corpora influences
large language models, revealing that training on the raw corpora endows the
model with domain knowledge, but drastically hurts its prompting ability for
question answering. Taken inspiration from human learning via reading
comprehension--practice after reading improves the ability to answer questions
based on the learned knowledge--we propose a simple method for transforming raw
corpora into reading comprehension texts. Each raw text is enriched with a
series of tasks related to its content. Our method, highly scalable and
applicable to any pre-training corpora, consistently enhances performance
across various tasks in three different domains: biomedicine, finance, and law.
Notably, our 7B language model achieves competitive performance with
domain-specific models of much larger scales, such as BloombergGPT-50B.
Furthermore, we demonstrate that domain-specific reading comprehension texts
can improve the model's performance even on general benchmarks, showing the
potential to develop a general model across even more domains. Our model, code,
and data are available at https://github.com/microsoft/LMOps.
comment: ICLR 2024 Conference
♻ ☆ Chain-of-Layer: Iteratively Prompting Large Language Models for Taxonomy Induction from Limited Examples
Automatic taxonomy induction is crucial for web search, recommendation
systems, and question answering. Manual curation of taxonomies is expensive in
terms of human effort, making automatic taxonomy construction highly desirable.
In this work, we introduce Chain-of-Layer which is an in-context learning
framework designed to induct taxonomies from a given set of entities.
Chain-of-Layer breaks down the task into selecting relevant candidate entities
in each layer and gradually building the taxonomy from top to bottom. To
minimize errors, we introduce the Ensemble-based Ranking Filter to reduce the
hallucinated content generated at each iteration. Through extensive
experiments, we demonstrate that Chain-of-Layer achieves state-of-the-art
performance on four real-world benchmarks.
♻ ☆ JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models
The rapid evolution of artificial intelligence (AI) through developments in
Large Language Models (LLMs) and Vision-Language Models (VLMs) has brought
significant advancements across various technological domains. While these
models enhance capabilities in natural language processing and visual
interactive tasks, their growing adoption raises critical concerns regarding
security and ethical alignment. This survey provides an extensive review of the
emerging field of jailbreaking--deliberately circumventing the ethical and
operational boundaries of LLMs and VLMs--and the consequent development of
defense mechanisms. Our study categorizes jailbreaks into seven distinct types
and elaborates on defense strategies that address these vulnerabilities.
Through this comprehensive examination, we identify research gaps and propose
directions for future studies to enhance the security frameworks of LLMs and
VLMs. Our findings underscore the necessity for a unified perspective that
integrates both jailbreak strategies and defensive solutions to foster a
robust, secure, and reliable environment for the next generation of language
models. More details can be found on our website:
\url{https://chonghan-chen.com/llm-jailbreak-zoo-survey/}.
comment: 45 pages
♻ ☆ Exploring Semantic Perturbations on Grover
With news and information being as easy to access as they currently are, it
is more important than ever to ensure that people are not mislead by what they
read. Recently, the rise of neural fake news (AI-generated fake news) and its
demonstrated effectiveness at fooling humans has prompted the development of
models to detect it. One such model is the Grover model, which can both detect
neural fake news to prevent it, and generate it to demonstrate how a model
could be misused to fool human readers. In this work we explore the Grover
model's fake news detection capabilities by performing targeted attacks through
perturbations on input news articles. Through this we test Grover's resilience
to these adversarial attacks and expose some potential vulnerabilities which
should be addressed in further iterations to ensure it can detect all types of
fake news accurately.